This notebook is to inspect, load and modify some of the code from the ENTs paper's tar file. The aim to produce a pickled object to return features for given protein pairs in the way it was done for Gene Ontology features.

Loading data

The data is loaded in the code in the file run_genome_prob.py. The code to do this is relatively short and has been loaded into the cell below.



In [1]:

    
cd ../../ents/standalone/









    



/home/gavin/Documents/MRes/ents/standalone



In [2]:

    
import pred_file_maker
from multiloc2 import MultiLoc2_highres
from DataFrame import DataFrame
from R_utilities import R_utilities as R
from optparse import OptionParser
from multiprocessing import Process
import os
import numpy as np



In [3]:

    
def get_n_unique_prots(subcell_file, prot_dict_file_name):
        subcell_info = MultiLoc2_highres(subcell_file)
        subcell_info = dict((k.split('.')[0],v) for k,v in subcell_info.gene_dict.iteritems())
        prot_dict_file = open(prot_dict_file_name)
        count = 0
        for line in prot_dict_file:
                line = line.strip().split('\t')
                if line[1] not in subcell_info: continue
                else: count += 1
        return count



In [4]:

    
# define file names
subcell_file = "h_sapiens_subcellular_multiloc2.out"
odds_file_name = "domain_odds.in"
domine_file_name = "domine_flat_file.in"
prot_dict_file_name = "H_sapiens_domains.out"



In [5]:

    
### Open the files ###
print "Reading input files..."
subcell_info = MultiLoc2_highres(subcell_file)
org_type = subcell_info.org_type
# Create dictionary of gene_name -> info for the subcellular localization
subcell_info = dict((k.split('.')[0],v) for k,v in subcell_info.gene_dict.iteritems())
odds_file = open(odds_file_name)
# Create dictionary of domain_pair -> odds_of_interaction_association
odds_dict = {}
for line in odds_file:
        line = line.strip().split(':')
        odds = float(line[1])
        pair = line[0].split(',')
        odds_dict[tuple(sorted(pair))] = odds
odds_file.close()
print "Done."
### Get essential protein information ###
print "Getting protein information..."
# Get all organism genes
proteins = []
# Dictionary from gene_stable_id -> protein_stable_id and the other way around
gene_protein_dict = {}
protein_gene_dict= {}
protein_domain_dict = {}
prot_dict_file = open(prot_dict_file_name)
for line in prot_dict_file:
        line = line.strip().split('\t')
        if line[1] not in subcell_info: continue
        proteins.append(line[0])
        gene_protein_dict[line[1]] = line[0]
        protein_gene_dict[line[0]] = line[1]
        if len(line) ==3: 
                protein_domain_dict[line[0]] = line[2].split(' ')
        else: protein_domain_dict[line[0]] = []
prot_dict_file.close()
# Get domine interactions as dictionary of domine_pair -> confidence
domine_interactions = {}
domine = open(domine_file_name)
for line in domine:
        line = line.strip().split('\t')
        domine_interactions[tuple(sorted(line[0].split(',')))] = line[1]
print "Done."









    



Reading input files...
Done.
Getting protein information...
Done.

Creating a feature vector

The code originally uses these parsed data files to then write a file which is input to R through Rserve. Looking at this code we can use this same data and create feature vectors for arbitrary protein pairs. The code to write to file is as follows:

for line in pair_file:
        line = line.strip().split('\t')
        # Get the domain part of the string
        protein1 = line[0]
        protein2 = line[1]
        if protein1 not in pair_prots: pair_prots.append(protein1)
        if protein2 not in pair_prots: pair_prots.append(protein2)
        try:
                domain_string = makeDomainString(protein_domain_dict[protein1], protein_domain_dict[protein2],\
                        domine_interactions, odds_dict)
        except:
                key1 = [x for x in protein_domain_dict.keys() if protein1.startswith(x.split('.')[0])]
                key2 = [x for x in protein_domain_dict.keys() if protein2.startswith(x.split('.')[0])]
                if len(key1) == 0 or len(key2) == 0: continue
                else:
                        domain_string = makeDomainString(protein_domain_dict[key1[0]], protein_domain_dict[key2[0]],\
                                domine_interactions, odds_dict)
        # Get the subcellular localization part of the string
        try: subcell_string = makeSubcellularDict(protein1, protein2, subcell_info, protein_gene_dict)
        except KeyError: 
                key1 = [x for x in protein_domain_dict.keys() if protein1.startswith(x.split('.')[0])]
                key2 = [x for x in protein_domain_dict.keys() if protein2.startswith(x.split('.')[0])]
                if len(key1) == 0 or len(key2) == 0: continue
                else:
                        try: subcell_string = makeSubcellularDict(key1[0], key2[0],subcell_info, protein_gene_dict)
                        except: continue
        out_file.write("%s\t%s\t%s\t%s\n" % (protein1,protein2,domain_string,subcell_string))

Taking this out of the loop and getting rid of the file references while handing in the protein names to the function should make this work.

First have to define a couple o



In [6]:

    
def makeDomainString(domains1, domains2, domine_dict, odds_dict):
        domain_dict = {}
        # Get all potential interactions
        if len(domains1) > 0 and len(domains2) > 0:
                potential_domain_pairs = getTwoListCombos(domains1,domains2)
        else: potential_domain_pairs = []
        # Get the domine information
        domine_pairs = list(set([tuple(sorted(x)) for x in potential_domain_pairs if tuple(sorted(x)) in domine_dict]))
        domain_dict['n_domine_pairs'] = len(domine_pairs)
        domain_dict['highest_domine_conf'] = '0'
        for pair in domine_pairs:
                if domine_dict[pair] == 'HC': domain_dict['highest_domine_conf'] = 'HC'
                elif domine_dict[pair] == 'MC' and domain_dict['highest_domine_conf'] != 'HC':
                        domain_dict['highest_domine_conf'] = 'MC'
                elif domine_dict[pair] == 'LC' and domain_dict['highest_domine_conf'] not in ['HC','MC']:
                        domain_dict['highest_domine_conf'] = 'LC'
        # Get the odds information
        ############## TEST ##################
        domain_dict['lowest_odds'] = 0.
        domain_dict['not_observed'] = 0
        domain_dict['not_observed_frac'] = 1.
        ################################
        domain_dict['sum_odds'] = 0.
        domain_dict['highest_odds'] = 0.
        domain_dict['n_odds_pairs'] = 0
        for pair in potential_domain_pairs:
                pair = tuple(list(sorted(pair)))
                # Check if in the odds dictionary. If so, update domain variables
                if pair in odds_dict:
                        domain_dict['sum_odds'] += odds_dict[pair]
                        if odds_dict[pair] > domain_dict['highest_odds']:
                                domain_dict['highest_odds'] = odds_dict[pair]
                        domain_dict['n_odds_pairs'] += 1
                ################### TEST ################
                        if odds_dict[pair] < domain_dict['lowest_odds']: domain_dict['lowest_odds'] = odds_dict[pair]
                else:
                        domain_dict['not_observed'] += 1
        if domain_dict['n_odds_pairs'] + domain_dict['not_observed'] > 0: 
                domain_dict['not_observed_frac'] = float(domain_dict['not_observed']) / (domain_dict['n_odds_pairs'] + domain_dict['not_observed'])
                #########################################
        domain_dict = [str(domain_dict[k]) for k in domain_cols]
        return "\t".join(domain_dict)



In [7]:

    
def makeSubcellularDict(protein1, protein2, subcell_dict, protein_gene_dict = None):
        gene1 = protein_gene_dict[protein1]
        gene2 = protein_gene_dict[protein2]
        svm_line1 = [subcell_dict[gene1]['predictions'][k] for k in svm_subcell_cols]
        svm_line2 = [subcell_dict[gene2]['predictions'][k] for k in svm_subcell_cols]
        svm_line1 += [subcell_dict[gene1]['svm_info'][k] for k in svm_detail_cols]
        svm_line2 += [subcell_dict[gene2]['svm_info'][k] for k in svm_detail_cols]
        svm_line = svm_line1 + svm_line2
        return "\t".join([str(x) for x in svm_line])



In [8]:

    
def getfeaturevector(protein1, protein2, subcell_info, odds_dict, proteins, domine_interactions, \
        protein_domain_dict, protein_gene_dict = None, verbose = True):
        """prototype function to retreive ENTS feature vectors for a given protein pair"""
        keys = subcell_info.keys()
        # Configure svm columns for organism
        gene = keys[0]
        delete_svm_pred_cols = [x for x in svm_subcell_cols if x not in subcell_info[gene]['predictions'].keys()]
        delete_svm_detail_cols = [x for x in svm_detail_cols if x not in subcell_info[gene]['svm_info'].keys()]
        for col in delete_svm_pred_cols: svm_subcell_cols.remove(col)
        for col in delete_svm_detail_cols: svm_detail_cols.remove(col)
        # Make the true interactions part of the file
        try:
                domain_string = makeDomainString(protein_domain_dict[protein1], protein_domain_dict[protein2],\
                        domine_interactions, odds_dict)
        except:
                key1 = [x for x in protein_domain_dict.keys() if protein1.startswith(x.split('.')[0])]
                key2 = [x for x in protein_domain_dict.keys() if protein2.startswith(x.split('.')[0])]
                if len(key1) == 0 or len(key2) == 0:
                    return None
                else:
                        domain_string = makeDomainString(protein_domain_dict[key1[0]], protein_domain_dict[key2[0]],\
                                domine_interactions, odds_dict)
        # Get the subcellular localization part of the string
        try: subcell_string = makeSubcellularDict(protein1, protein2, subcell_info, protein_gene_dict)
        except KeyError: 
                key1 = [x for x in protein_domain_dict.keys() if protein1.startswith(x.split('.')[0])]
                key2 = [x for x in protein_domain_dict.keys() if protein2.startswith(x.split('.')[0])]
                if len(key1) == 0 or len(key2) == 0:
                    return None
                else:
                        subcell_string = makeSubcellularDict(key1[0], key2[0],subcell_info, protein_gene_dict)
                        #try: subcell_string = makeSubcellularDict(key1[0], key2[0],subcell_info, protein_gene_dict)
                        #except: 
                        #    return None
        return domain_string, subcell_string

Testing new function

Testing new function with arbitrary protein pairs until it produces a feature vector:



In [9]:

    
import itertools



In [10]:

    
from pred_file_maker import domain_cols, svm_subcell_cols, getTwoListCombos, svm_detail_cols



In [31]:

    
for pair in itertools.combinations(proteins,2):
    fvector = getfeaturevector(pair[0],pair[1],subcell_info,odds_dict,proteins,
                 domine_interactions,protein_domain_dict,protein_gene_dict)
    if fvector:
        print fvector
        break









    



('0.0\t0.0\t0.0\t0\t1.0\t0\t0\t0', '0.15\t0.02\t0.04\t0.0\t0.77\t0.0\t0.0\t0.0\t0.0\t0.000782241\t0.994933\t0.00428497\t-0.965455\t0.456943\t0.161587\t0.694282\t0.048346\t0.086769\t0.016035\t0.152056\t0.045396\t0.026867\t0.11\t0.11\t0.11\t0.11\t0.11\t0.11\t0.11\t0.11\t0.110209\t0.10186\t0.027069\t0.150298\t0.044921\t0.383831\t0.064664\t0.034212\t0\t0\t0\t0\t0\t0\t0\t0.082935\t0.11\t0.006002\t0.3\t0.55\t0.05\t0.0\t0.08\t0.01\t0.01\t0.01\t0.0\t0.269538\t0.118325\t0.612137\t-0.951564\t0.046808\t0.077552\t0.153458\t0.033389\t0.094736\t0.234725\t0.054657\t0.129915\t0.116351\t0.886868\t0.06687\t0.019144\t0.003098\t0.003295\t0.001927\t0.001497\t0.001791\t0.099515\t0.435004\t0.078238\t0.033013\t0.040732\t0.207074\t0.038316\t0.022215\t0\t0\t0\t0\t0\t0\t0\t0.045894\t0.01551\t0.211866')

So we can get feature vectors out of this for arbitrary protein pairs.

Writing new class to return feature vectors

To return these feature vectors we require a class that integrates the above functions and stores the required data extracted from the files in this folder. This class can then be added to ocbio as with Gene Ontology and serve a custom generator for the feature vectors to be generated. The class will be instantiated and pickled as with the Gene Ontology notebook.

The features should also be returned as a list rather that a tab-delimited string as above. Also, it will have to be able to deal with Entrez IDs rather than Ensembl IDs, so it will need to store a 1 to 1 dictionary for this internally.



In [12]:

    
import sys
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")



In [13]:

    
import ocbio.ents

Loading Entrez to Ensembl dictionary

The dictionary used here is not one to one, and this must be taken into account in the code above. This dictionary is loaded from a pickled dictionary saved in notebook on Extracting InterologWalk features.



In [14]:

    
import pickle



In [15]:

    
import csv



In [18]:

    
cd ../../geneconversion/









    



/home/gavin/Documents/MRes/geneconversion



In [19]:

    
f = open("human.gene2ensemble.pickle")
gentreztoensembl = pickle.load(f)
f.close()

Unfortunately, this dictionary doesn't map to all the proteins which we would like to be able to map to:



In [57]:

    
len(proteins)









    Out[57]:





20993



In [64]:

    
crossover = 0
for p in proteins:
    if p in flatten(gentreztoensembl.values()):
        crossover += 1



In [65]:

    
print crossover



In [66]:

    
print len(proteins)-crossover

So improving this dictionary with a BioMart csv. Writing a file for the web conversion service.



In [69]:

    
f = open("ents.proteins.ensembl.txt","w")
csv.writer(f,delimiter="\n").writerow(proteins)
f.close()

Downloaded the file biomart.ensembl2entrez.txt, loading this and adding its entries to the dictionary:



In [77]:

    
f = open("biomart.ensembl2entrez.txt")
c = csv.reader(f,delimiter="\t")
c.next()
for line in c:
    try:
        gentreztoensembl[line[1]] += [line[0]]
    except KeyError:
        gentreztoensembl[line[1]] = [line[0]]
f.close()

Has that improved the situation?



In [78]:

    
crossover = 0
mappedproteins = list(flatten(gentreztoensembl.values()))
for p in proteins:
    if p in mappedproteins:
        crossover += 1



In [79]:

    
print crossover



In [80]:

    
print len(proteins)-crossover

Saving the instances

With the data defined in this notebook, we can now pickle two instances of our ENTS feature extractors:



In [21]:

    
cd ../ents/









    



/home/gavin/Documents/MRes/ents



In [22]:

    
reload(ocbio.ents)









    Out[22]:





<module 'ocbio.ents' from '/home/gavin/Documents/MRes/opencast-bio/ocbio/ents.pyc'>



In [23]:

    
entsfeatures = ocbio.ents.ENTSfeatures(subcell_info,odds_dict,proteins,domine_interactions,
                        protein_domain_dict,protein_gene_dict,gentreztoensembl,
                        domain_cols,svm_subcell_cols,svm_detail_cols)

Testing

Testing the instances in the same way as above to confirm that it will return viable feature vectors:



In [24]:

    
for pair in itertools.combinations(proteins,2):
    fvector = entsfeatures.getfeaturevector(pair[0],pair[1])
    if fvector:
        print fvector
        break









    



[0.0, 0.0, 0.0, 0, 1.0, 0, 0, '0', 0.15, 0.02, 0.04, 0.0, 0.77, 0.0, 0.0, 0.0, 0.0, 0.000782241, 0.994933, 0.00428497, -0.965455, 0.456943, 0.161587, 0.694282, 0.048346, 0.086769, 0.016035, 0.152056, 0.045396, 0.026867, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.110209, 0.10186, 0.027069, 0.150298, 0.044921, 0.383831, 0.064664, 0.034212, 0, 0, 0, 0, 0, 0, 0, 0.082935, 0.11, 0.006002, 0.3, 0.55, 0.05, 0.0, 0.08, 0.01, 0.01, 0.01, 0.0, 0.269538, 0.118325, 0.612137, -0.951564, 0.046808, 0.077552, 0.153458, 0.033389, 0.094736, 0.234725, 0.054657, 0.129915, 0.116351, 0.886868, 0.06687, 0.019144, 0.003098, 0.003295, 0.001927, 0.001497, 0.001791, 0.099515, 0.435004, 0.078238, 0.033013, 0.040732, 0.207074, 0.038316, 0.022215, 0, 0, 0, 0, 0, 0, 0, 0.045894, 0.01551, 0.211866]

And testing calling with Entrez pairs as frozensets:



In [25]:

    
ensembltoentrez = {}
for k in gentreztoensembl.keys():
    ensembltoentrez[gentreztoensembl[k][0]] = k
entrezlist = []
for p in proteins:
    try:
        entrezlist.append(ensembltoentrez[p])
    except:
        pass



In [26]:

    
for pair in itertools.combinations(entrezlist,2):
    try:
        fvector = entsfeatures[frozenset(pair)]
    except KeyError:
        continue
    if fvector:
        print fvector
        break









    



[-5.6432199999999995, 0.0, -2.04361, 2, 0.3333333333333333, 4, 0, '0', 0.3, 0.55, 0.05, 0.0, 0.08, 0.01, 0.01, 0.01, 0.0, 0.269538, 0.118325, 0.612137, -0.951564, 0.046808, 0.077552, 0.153458, 0.033389, 0.094736, 0.234725, 0.054657, 0.129915, 0.116351, 0.886868, 0.06687, 0.019144, 0.003098, 0.003295, 0.001927, 0.001497, 0.001791, 0.099515, 0.435004, 0.078238, 0.033013, 0.040732, 0.207074, 0.038316, 0.022215, 0, 0, 0, 0, 0, 0, 0, 0.045894, 0.01551, 0.211866, 0.95, 0.03, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.00332134, 0.00699508, 0.989684, -0.958667, 0.020718, 0.04001, 0.0962077, 0.277577, 0.008549, 0.037989, 0.206557, 0.148466, 0.206238, 0.044603, 0.768107, 0.023711, 0.013263, 0.081663, 0.010586, 0.037635, 0.010424, 0.19657, 0.525252, 0.0869, 0.030084, 0.024109, 0.070087, 0.038573, 0.011674, 0, 0, 0, 0, 0, 0, 0, 0.016751, 0.010008, 0.053896]



In [27]:

    
f = open("../DIP/human/training.nolabel.positive.Entrez.txt")
dippairs = list(map(lambda x: frozenset(x),csv.reader(f,delimiter="\t")))
f.close()



In [90]:

    
highest_domine_conf = []
for pair in dippairs:
    try:
        fvector = entsfeatures[pair]
        highest_domine_conf.append(fvector[7])
    except KeyError as inst:
        continue

How long are these vectors?



In [30]:

    
print len(fvector)

Dealing with bad columns

The feature vectors produced contain some strings that are not explained in the documentation. In the above feature vector, for instance, there is a string 'HC' in the column that should be highest_domine_conf. This should be:

The highest confidence score for a pair of domains verified to interact in the DOMINE database.

The string will cause problems when trying to train the classifier as that cannot take a string as input. Other columns have also been observed causing problems as can be seen in older versions of the Classifier Training notebook.

To deal with the highest_domine_conf column we first should look at what the different options are:



In [84]:

    
print set(highest_domine_conf)









    



set(['HC', 'LC', 'MC', '0'])

Assuming that those mean:

HC - high confidence
MC - medium confidence
LC - low confidence
0 - don't know

Then these are categories and we can use 1-of-k encoding. The ocbio.ents code will implement this transformation. Above debugging logs in this notebook will not run anymore.



In [99]:

    
reload(ocbio.ents)









    Out[99]:





<module 'ocbio.ents' from '/home/gavin/Documents/MRes/opencast-bio/ocbio/ents.py'>



In [100]:

    
entsfeatures = ocbio.ents.ENTSfeatures(subcell_info,odds_dict,proteins,domine_interactions,
                        protein_domain_dict,protein_gene_dict,gentreztoensembl,
                        domain_cols,svm_subcell_cols,svm_detail_cols)



In [106]:

    
highest_domine_conf = []
for pair in dippairs:
    try:
        fvector = entsfeatures[pair]
        highest_domine_conf.append(fvector[7:11])
    except KeyError as inst:
        continue



In [107]:

    
highest_domine_conf = map(tuple, highest_domine_conf)
print set(highest_domine_conf)









    



set([(0, 0, 0, 1), (1, 0, 0, 0), (0, 0, 1, 0), (0, 1, 0, 0)])

Looking for other bad columns

We can iterate over all possible feature vectors that could be produced to search for other bad features such as this:



In [116]:

    
import scipy.misc



In [118]:

    
total = scipy.misc.comb(len(gentreztoensembl.keys()),2)



In [131]:

    
stringdict = {}
lcount = 0
for pair in itertools.combinations(gentreztoensembl.keys(),2):
    try:
        fvector = entsfeatures[pair]
    except KeyError:
        pass
    #look in this vector for strings
    try:
        vecfloats = map(float,fvector)
    except:
        #must be a string in there
        #where is it?
        for i,x in enumerate(fvector):
            try:\
                float(x)
            except:
                try:
                    stringdict[i] += [x]
                except KeyError:
                    stringdict[i] = [x]
    if lcount%int(total/1000)==0:
        print lcount
    lcount +=1









    



0
220342
440684
661026
881368
1101710
1322052
1542394
1762736
1983078
2203420
2423762
2644104
2864446
3084788
3305130
3525472
3745814
3966156
4186498
4406840
4627182
4847524
5067866
5288208
5508550
5728892
5949234
6169576
6389918
6610260
6830602
7050944
7271286
7491628
7711970
7932312
8152654
8372996
8593338
8813680
9034022
9254364
9474706
9695048
9915390
10135732
10356074
10576416
10796758
11017100
11237442
11457784
11678126
11898468
12118810
12339152
12559494
12779836
13000178
13220520
13440862
13661204
13881546
14101888
14322230
14542572
14762914
14983256
15203598
15423940
15644282
15864624
16084966
16305308
16525650
16745992
16966334
17186676
17407018
17627360
17847702
18068044
18288386
18508728
18729070
18949412
19169754
19390096
19610438
19830780
20051122
20271464
20491806
20712148
20932490
21152832
21373174
21593516
21813858
22034200
22254542
22474884
22695226
22915568
23135910
23356252
23576594
23796936
24017278
24237620
24457962
24678304
24898646
25118988
25339330
25559672
25780014
26000356
26220698
26441040
26661382
26881724
27102066
27322408
27542750
27763092
27983434
28203776
28424118
28644460
28864802
29085144
29305486
29525828
29746170
29966512
30186854
30407196
30627538
30847880
31068222
31288564
31508906
31729248
31949590
32169932
32390274
32610616
32830958
33051300
33271642
33491984
33712326
33932668
34153010
34373352
34593694
34814036
35034378
35254720
35475062
35695404
35915746
36136088
36356430
36576772
36797114
37017456
37237798
37458140
37678482
37898824
38119166
38339508
38559850
38780192
39000534
39220876
39441218
39661560
39881902
40102244
40322586
40542928
40763270
40983612
41203954
41424296
41644638
41864980
42085322
42305664
42526006
42746348
42966690
43187032
43407374
43627716
43848058
44068400
44288742
44509084
44729426
44949768
45170110
45390452
45610794
45831136
46051478
46271820
46492162
46712504
46932846
47153188
47373530
47593872
47814214
48034556
48254898
48475240
48695582
48915924
49136266
49356608
49576950
49797292
50017634
50237976
50458318
50678660
50899002
51119344
51339686
51560028
51780370
52000712
52221054
52441396
52661738
52882080
53102422
53322764
53543106
53763448
53983790
54204132
54424474
54644816
54865158
55085500
55305842
55526184
55746526
55966868
56187210
56407552
56627894
56848236
57068578
57288920
57509262
57729604
57949946
58170288
58390630
58610972
58831314
59051656
59271998
59492340
59712682
59933024
60153366
60373708
60594050
60814392
61034734
61255076
61475418
61695760
61916102
62136444
62356786
62577128






    



---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-131-b9f46696fe28> in <module>()
      3 for pair in itertools.combinations(gentreztoensembl.keys(),2):
      4     try:
----> 5         fvector = entsfeatures[pair]
      6     except KeyError:
      7         pass

/home/gavin/Documents/MRes/opencast-bio/ocbio/ents.py in __getitem__(self, key)
    158                 raise KeyError("Unknown protein")
    159         #retrieve the feature vector
--> 160         fvector = self.getfeaturevector(protein1,protein2)
    161         if fvector == None:
    162             raise KeyError("No feature vector found for pair {0}".format(pair))

KeyboardInterrupt:

Had to interrupt the above loop as it had already been running for over 12 hours. Tested 62 million pairs though and didn't see a single string, as shown below:



In [132]:

    
print stringdict.keys()

[]

Pickling instances

Now that the object has been tested we can pickle the two instances:

One with the full Entrez to Ensembl dictionary
One with the conservative Entrez to Ensembl dictionary



In [134]:

    
cd ../ents/









    



/home/gavin/Documents/MRes/ents



In [135]:

    
!git annex unlock human.ENTS.features.pickle









    



unlock human.ENTS.features.pickle (copying...) ok



In [136]:

    
f = open("human.ENTS.features.pickle","wb")
pickle.dump(entsfeatures,f)
f.close()

Writing column labels

For reference, will write a file containing all the column labels for each feature:



In [137]:

    
f = open("human.ENTS.features.labels.txt","w")
csv.writer(f,delimiter="\n").writerow(domain_cols+svm_subcell_cols+svm_detail_cols)
f.close()